{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "3o8Qof7Cy165"
   },
   "source": [
    "# Basic Feature Engineering in BQML \n",
    "\n",
    "**Learning Objectives**\n",
    "\n",
    "1. Create SQL statements to evaluate the model\n",
    "2. Extract temporal features\n",
    "3. Perform a feature cross on temporal features\n",
    "\n",
    "\n",
    "## Introduction \n",
    "In this lab, we utilize feature engineering to improve the prediction of the fare amount for a taxi ride in New York City.  We will use BigQuery ML to build a taxifare prediction model, using feature engineering to improve and create a final model.\n",
    "\n",
    "In this Notebook we set up the environment, create the project dataset, create a feature engineering table, create and evaluate a baseline model, extract temporal features, perform a feature cross on temporal features, and evaluate model performance throughout the process. \n",
    "\n",
    "Each learning objective will correspond to a __#TODO__ in the [student lab notebook](https://github.com/GoogleCloudPlatform/training-data-analyst/blob/master/courses/machine_learning/deepdive2/feature_engineering/labs/1_bqml_basic_feat_eng_bqml-lab.ipynb) -- try to complete that notebook first before reviewing this solution notebook."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "hJ7ByvoXzpVI"
   },
   "source": [
    "### Set up environment variables and load necessary libraries"
   ]
  },
  {
    "cell_type": "code",
    "metadata": {
      "id": "Nny3m465gKkY",
      "colab_type": "code",
      "colab": {}
    },
    "source": [
      "# Installing the latest version of the package\n",
      "!pip install --user google-cloud-bigquery==3.4.1"
    ],
    "execution_count": null,
    "outputs": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Note**: Restart your kernel to use updated packages."
   ]
  },
  {
    "cell_type": "markdown",
    "metadata": {},
    "source": [
     "Kindly ignore the deprecation warnings and incompatibility errors related to google-cloud-storage."
    ]
   },
   {
     "cell_type": "code",
     "metadata": {
       "id": "Nny3m465gKkY",
       "colab_type": "code",
       "colab": {}
     },
     "source": [
       "# Installing the latest version of the package\n",
       "import tensorflow as tf\n",
       "print(\"TensorFlow version: \",tf.version.VERSION)\n"
     ],
     "execution_count": null,
     "outputs": []
   },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Your current GCP Project Name is: munn-sandbox\n"
     ]
    }
   ],
   "source": [
    "%%bash\n",
    "# Exporting the project\n",
    "\n",
    "export PROJECT=$(gcloud config list project --format \"value(core.project)\")\n",
    "echo \"Your current GCP Project Name is: \"$PROJECT"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "L0-vOB4y2BJM"
   },
   "source": [
    "## The source dataset\n",
    "\n",
    "Our dataset is hosted in [BigQuery](https://cloud.google.com/bigquery/). The taxi fare data is a publically available dataset, meaning anyone with a GCP account has access. Click [here](https://console.cloud.google.com/bigquery?project=bigquery-public-data&p=nyc-tlc&d=yellow&t=trips&page=table) to access the dataset.\n",
    "\n",
    "The Taxi Fare dataset is relatively large at 55 million training rows, but simple to understand, with only six features. The fare_amount is the target, the continuous value we’ll train a model to predict.\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Create a BigQuery Dataset\n",
    "\n",
    "A BigQuery dataset is a container for tables, views, and models built with BigQuery ML. Let's create one called __feat_eng__ if we have not already done so in an earlier lab. We'll do the same for a GCS bucket for our project too."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Creating BigQuery dataset titled: feat_eng\n",
      "Dataset 'munn-sandbox:feat_eng' successfully created.\n",
      "\\nHere are your current datasets:\n",
      "    datasetId    \n",
      " --------------- \n",
      "  None           \n",
      "  arc_results    \n",
      "  babyweight     \n",
      "  bqml_taxifare  \n",
      "  demo           \n",
      "  demo_sample    \n",
      "  feat_eng       \n",
      "  movielens      \n",
      "  patent_demo    \n",
      "  patents        \n",
      "  stock_market   \n",
      "  taxifare       \n",
      "  test_sample    \n"
     ]
    }
   ],
   "source": [
    "%%bash\n",
    "\n",
    "# Create a BigQuery dataset for feat_eng if it doesn't exist\n",
    "datasetexists=$(bq ls -d | grep -w feat_eng)\n",
    "\n",
    "if [ -n \"$datasetexists\" ]; then\n",
    "    echo -e \"BigQuery dataset already exists, let's not recreate it.\"\n",
    "\n",
    "else\n",
    "    echo \"Creating BigQuery dataset titled: feat_eng\"\n",
    "    \n",
    "    bq --location=US mk --dataset \\\n",
    "        --description 'Taxi Fare' \\\n",
    "        $PROJECT:feat_eng\n",
    "   echo \"\\nHere are your current datasets:\"\n",
    "   bq ls\n",
    "fi    "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "b2TuS1s9vREL"
   },
   "source": [
    "## Create the training data table\n",
    "\n",
    "Since there is already a publicly available dataset, we can simply create the training data table using this raw input data.  Note the WHERE clause in the below query:  This clause allows us to TRAIN a portion of the data (e.g. one hundred thousand rows versus one million rows), which keeps your query costs down.  If you need a refresher on using MOD() for repeatable splits see this [post](https://www.oreilly.com/learning/repeatable-sampling-of-data-sets-in-bigquery-for-machine-learning).  \n",
    "\n",
    "**Note:**  The dataset in the create table code below is the one created previously, e.g. \"feat_eng\".  The table name is \"feateng_training_data\".  **Run the query to create the table**."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {
    "colab": {},
    "colab_type": "code",
    "id": "CMNRractvREL"
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "Empty DataFrame\n",
       "Columns: []\n",
       "Index: []"
      ]
     },
     "execution_count": 3,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "%%bigquery \n",
    "\n",
    "CREATE OR REPLACE TABLE\n",
    "  feat_eng.feateng_training_data AS\n",
    "SELECT\n",
    "  (tolls_amount + fare_amount) AS fare_amount,\n",
    "  passenger_count*1.0 AS passengers,\n",
    "  pickup_datetime,\n",
    "  pickup_longitude AS pickuplon,\n",
    "  pickup_latitude AS pickuplat,\n",
    "  dropoff_longitude AS dropofflon,\n",
    "  dropoff_latitude AS dropofflat\n",
    "FROM\n",
    "  `nyc-tlc.yellow.trips`\n",
    "WHERE\n",
    "  MOD(ABS(FARM_FINGERPRINT(CAST(pickup_datetime AS STRING))), 10000) = 1\n",
    "  AND fare_amount >= 2.5\n",
    "  AND passenger_count > 0\n",
    "  AND pickup_longitude > -78\n",
    "  AND pickup_longitude < -70\n",
    "  AND dropoff_longitude > -78\n",
    "  AND dropoff_longitude < -70\n",
    "  AND pickup_latitude > 37\n",
    "  AND pickup_latitude < 45\n",
    "  AND dropoff_latitude > 37\n",
    "  AND dropoff_latitude < 45"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "clnaaqQsXkwC"
   },
   "source": [
    "## Verify table creation\n",
    "\n",
    "Verify that you created the dataset.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>fare_amount</th>\n",
       "      <th>passengers</th>\n",
       "      <th>pickup_datetime</th>\n",
       "      <th>pickuplon</th>\n",
       "      <th>pickuplat</th>\n",
       "      <th>dropofflon</th>\n",
       "      <th>dropofflat</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "Empty DataFrame\n",
       "Columns: [fare_amount, passengers, pickup_datetime, pickuplon, pickuplat, dropofflon, dropofflat]\n",
       "Index: []"
      ]
     },
     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "%%bigquery\n",
    "\n",
    "# LIMIT 0 is a free query; this allows us to check that the table exists.\n",
    "SELECT\n",
    "*\n",
    "FROM\n",
    "  feat_eng.feateng_training_data\n",
    "LIMIT\n",
    "  0"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "RhgXan8wvREN"
   },
   "source": [
    "### Baseline Model:  Create the baseline model\n",
    "\n",
    "Next, you create a linear regression baseline model with no feature engineering.  Recall that a model in BigQuery ML represents what an ML system has learned from the training data.  A baseline model is a solution to a problem without applying any machine learning techniques.  \n",
    "\n",
    "When creating a BQML model, you must specify the model type (in our case linear regression) and the input label (fare_amount).  Note also that we are using the training data table as the data source."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "kb_5NlfU7oyT"
   },
   "source": [
    "Now we create the SQL statement to create the baseline model."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "Empty DataFrame\n",
       "Columns: []\n",
       "Index: []"
      ]
     },
     "execution_count": 5,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "%%bigquery\n",
    "\n",
    "CREATE OR REPLACE MODEL\n",
    "  feat_eng.baseline_model OPTIONS (model_type='linear_reg',\n",
    "    input_label_cols=['fare_amount']) AS\n",
    "SELECT\n",
    "  fare_amount,\n",
    "  passengers,\n",
    "  pickup_datetime,\n",
    "  pickuplon,\n",
    "  pickuplat,\n",
    "  dropofflon,\n",
    "  dropofflat\n",
    "FROM\n",
    "  feat_eng.feateng_training_data"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "Tq2KYJOM9ULC"
   },
   "source": [
    "\n",
    "Note, the query takes several minutes to complete. After the first iteration is complete, your model (baseline_model) appears in the navigation panel of the BigQuery web UI. Because the query uses a CREATE MODEL statement to create a model, you do not see query results.\n",
    "\n",
    "You can observe the model as it's being trained by viewing the Model stats tab in the BigQuery web UI. As soon as the first iteration completes, the tab is updated. The stats continue to update as each iteration completes."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "HO5d50Eic-X1"
   },
   "source": [
    "Once the training is done, visit the [BigQuery Cloud Console](https://console.cloud.google.com/bigquery) and look at the model that has been trained. Then, come back to this notebook."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "RSgIJqN6vREV"
   },
   "source": [
    "### Evaluate the baseline model\n",
    "Note that BigQuery automatically split the data we gave it, and trained on only a part of the data and used the rest for evaluation.  After creating your model, you evaluate the performance of the regressor using the ML.EVALUATE function. The ML.EVALUATE function evaluates the predicted values against the actual data.\n",
    "\n",
    "NOTE: The results are also displayed in the [BigQuery Cloud Console](https://console.cloud.google.com/bigquery) under the **Evaluation** tab."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Review the learning and eval statistics for the baseline_model."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>training_run</th>\n",
       "      <th>iteration</th>\n",
       "      <th>loss</th>\n",
       "      <th>eval_loss</th>\n",
       "      <th>learning_rate</th>\n",
       "      <th>duration_ms</th>\n",
       "      <th>rmse</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>74.43591</td>\n",
       "      <td>68.880408</td>\n",
       "      <td>None</td>\n",
       "      <td>12426</td>\n",
       "      <td>8.627625</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   training_run  iteration      loss  eval_loss learning_rate  duration_ms  \\\n",
       "0             0          0  74.43591  68.880408          None        12426   \n",
       "\n",
       "       rmse  \n",
       "0  8.627625  "
      ]
     },
     "execution_count": 6,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "%%bigquery\n",
    "\n",
    "# Eval statistics on the held out data.\n",
    "# Here, ML.EVALUATE function is used to evaluate model metrics\n",
    "SELECT\n",
    "  *,\n",
    "  SQRT(loss) AS rmse\n",
    "FROM\n",
    "  ML.TRAINING_INFO(MODEL feat_eng.baseline_model)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>mean_absolute_error</th>\n",
       "      <th>mean_squared_error</th>\n",
       "      <th>mean_squared_log_error</th>\n",
       "      <th>median_absolute_error</th>\n",
       "      <th>r2_score</th>\n",
       "      <th>explained_variance</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>5.213479</td>\n",
       "      <td>68.880408</td>\n",
       "      <td>0.258098</td>\n",
       "      <td>3.794535</td>\n",
       "      <td>0.226065</td>\n",
       "      <td>0.22613</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   mean_absolute_error  mean_squared_error  mean_squared_log_error  \\\n",
       "0             5.213479           68.880408                0.258098   \n",
       "\n",
       "   median_absolute_error  r2_score  explained_variance  \n",
       "0               3.794535  0.226065             0.22613  "
      ]
     },
     "execution_count": 7,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "%%bigquery\n",
    "\n",
    "# Here, ML.EVALUATE function is used to evaluate model metrics\n",
    "SELECT\n",
    "  *\n",
    "FROM\n",
    "  ML.EVALUATE(MODEL feat_eng.baseline_model)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "xJGbfYuD8a9d"
   },
   "source": [
    "**NOTE:** Because you performed a linear regression, the results include the following columns:\n",
    "\n",
    "*   mean_absolute_error\n",
    "*   mean_squared_error\n",
    "*   mean_squared_log_error\n",
    "*   median_absolute_error\n",
    "*   r2_score\n",
    "*   explained_variance\n",
    "\n",
    "**Resource** for an explanation of the [Regression Metrics](https://towardsdatascience.com/metrics-to-evaluate-your-machine-learning-algorithm-f10ba6e38234).\n",
    "\n",
    "**Mean squared error** (MSE) - Measures the difference between the values our model predicted using the test set and the actual values. You can also think of it as the distance between your regression (best fit) line and the predicted values. \n",
    "\n",
    "**Root mean squared error** (RMSE) - The primary evaluation metric for this ML problem is the root mean-squared error. RMSE measures the difference between the predictions of a model, and the observed values. A large RMSE is equivalent to a large average error, so smaller values of RMSE are better. One nice property of RMSE is that the error is given in the units being measured, so you can tell very directly how incorrect the model might be on unseen data.\n",
    "\n",
    "**R2**:  An important metric in the evaluation results is the R2 score. The R2 score is a statistical measure that determines if the linear regression predictions approximate the actual data. Zero (0) indicates that the model explains none of the variability of the response data around the mean.  One (1) indicates that the model explains all the variability of the response data around the mean."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "p_21sAIR7LZw"
   },
   "source": [
    "Next, we write a SQL query to take the SQRT() of the mean squared error as your loss metric for evaluation for the benchmark_model."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {
    "cellView": "form",
    "colab": {},
    "colab_type": "code",
    "id": "8mAXRTvbvRES"
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>rmse</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>8.299422</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "       rmse\n",
       "0  8.299422"
      ]
     },
     "execution_count": 8,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "%%bigquery\n",
    "#TODO 1\n",
    "\n",
    "# Here, ML.EVALUATE function is used to evaluate model metrics\n",
    "SELECT\n",
    "  SQRT(mean_squared_error) AS rmse\n",
    "FROM\n",
    "  ML.EVALUATE(MODEL feat_eng.baseline_model)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "nW6fnqAW8vyI"
   },
   "source": [
    "#### Model 1:  EXTRACT dayofweek from the pickup_datetime feature.\n",
    "\n",
    "* As you recall, dayofweek is an enum representing the 7 days of the week. This factory allows the enum to be obtained from the int value. The int value follows the ISO-8601 standard, from 1 (Monday) to 7 (Sunday).\n",
    "\n",
    "* If you were to extract the dayofweek from pickup_datetime using BigQuery SQL, the datatype returned would be integer."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Next, we create a model titled \"model_1\" from the benchmark model and extract out the DayofWeek."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {
    "cellView": "both",
    "colab": {},
    "colab_type": "code",
    "id": "ZQ0kT2jN-vpm"
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "Empty DataFrame\n",
       "Columns: []\n",
       "Index: []"
      ]
     },
     "execution_count": 9,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "%%bigquery\n",
    "#TODO 2\n",
    "\n",
    "CREATE OR REPLACE MODEL\n",
    "  feat_eng.model_1 OPTIONS (model_type='linear_reg',\n",
    "    input_label_cols=['fare_amount']) AS\n",
    "SELECT\n",
    "  fare_amount,\n",
    "  passengers,\n",
    "  pickup_datetime,\n",
    "  EXTRACT(DAYOFWEEK\n",
    "  FROM\n",
    "    pickup_datetime) AS dayofweek,\n",
    "  pickuplon,\n",
    "  pickuplat,\n",
    "  dropofflon,\n",
    "  dropofflat\n",
    "FROM\n",
    "  feat_eng.feateng_training_data"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "T24XjIJgdLCH"
   },
   "source": [
    "Once the training is done, visit the [BigQuery Cloud Console](https://console.cloud.google.com/bigquery) and look at the model that has been trained. Then, come back to this notebook."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "SRLxpccX_Tin"
   },
   "source": [
    "Next, two distinct SQL statements show the TRAINING and EVALUATION metrics of model_1."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 104,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>training_run</th>\n",
       "      <th>iteration</th>\n",
       "      <th>loss</th>\n",
       "      <th>eval_loss</th>\n",
       "      <th>learning_rate</th>\n",
       "      <th>duration_ms</th>\n",
       "      <th>rmse</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>72.440724</td>\n",
       "      <td>88.953232</td>\n",
       "      <td>None</td>\n",
       "      <td>17251</td>\n",
       "      <td>8.511212</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   training_run  iteration       loss  eval_loss learning_rate  duration_ms  \\\n",
       "0             0          0  72.440724  88.953232          None        17251   \n",
       "\n",
       "       rmse  \n",
       "0  8.511212  "
      ]
     },
     "execution_count": 104,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "%%bigquery\n",
    "\n",
    "# Here, ML.TRAINING_INFO function is used to see information about the training iterations of a model.\n",
    "SELECT\n",
    "  *,\n",
    "  SQRT(loss) AS rmse\n",
    "FROM\n",
    "  ML.TRAINING_INFO(MODEL feat_eng.model_1)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 108,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>mean_absolute_error</th>\n",
       "      <th>mean_squared_error</th>\n",
       "      <th>mean_squared_log_error</th>\n",
       "      <th>median_absolute_error</th>\n",
       "      <th>r2_score</th>\n",
       "      <th>explained_variance</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <td>0</td>\n",
       "      <td>5.287076</td>\n",
       "      <td>88.953232</td>\n",
       "      <td>0.260932</td>\n",
       "      <td>3.713439</td>\n",
       "      <td>0.08481</td>\n",
       "      <td>0.084811</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   mean_absolute_error  mean_squared_error  mean_squared_log_error  \\\n",
       "0             5.287076           88.953232                0.260932   \n",
       "\n",
       "   median_absolute_error  r2_score  explained_variance  \n",
       "0               3.713439   0.08481            0.084811  "
      ]
     },
     "execution_count": 108,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "%%bigquery\n",
    "\n",
    "# Here, ML.EVALUATE function is used to evaluate model metrics\n",
    "SELECT\n",
    "  *\n",
    "FROM\n",
    "  ML.EVALUATE(MODEL feat_eng.model_1)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "wf-9FBmL_Ti_"
   },
   "source": [
    "Here we run a SQL query to take the SQRT() of the mean squared error as your loss metric for evaluation for the benchmark_model."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 117,
   "metadata": {
    "cellView": "both",
    "colab": {},
    "colab_type": "code",
    "id": "CsVBzNef_TjC"
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>rmse</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <td>0</td>\n",
       "      <td>9.431502</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "       rmse\n",
       "0  9.431502"
      ]
     },
     "execution_count": 117,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "%%bigquery\n",
    "\n",
    "# Here, ML.EVALUATE function is used to evaluate model metrics\n",
    "SELECT\n",
    "  SQRT(mean_squared_error) AS rmse\n",
    "FROM\n",
    "  ML.EVALUATE(MODEL feat_eng.model_1)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "Lw30UexH8v9P"
   },
   "source": [
    "### Model 2:  EXTRACT hourofday from the pickup_datetime feature\n",
    "\n",
    "As you recall, **pickup_datetime** is stored as a TIMESTAMP, where the Timestamp format is retrieved in the standard output format – year-month-day hour:minute:second (e.g. 2016-01-01 23:59:59).  Hourofday returns the integer number representing the hour number of the given date.\n",
    "\n",
    "Hourofday is best thought of as a discrete ordinal variable (and not a categorical feature), as the hours can be ranked (e.g. there is a natural ordering of the values).  Hourofday has an added characteristic of being cyclic, since 12am follows 11pm and precedes 1am."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Next, we create a model titled \"model_2\" and EXTRACT the hourofday from the pickup_datetime feature to improve our model's rmse."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {
    "cellView": "both",
    "colab": {},
    "colab_type": "code",
    "id": "FHeqcYz-B9F1"
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "Empty DataFrame\n",
       "Columns: []\n",
       "Index: []"
      ]
     },
     "execution_count": 11,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "%%bigquery\n",
    "#TODO 3a\n",
    "\n",
    "CREATE OR REPLACE MODEL\n",
    "  feat_eng.model_2 OPTIONS (model_type='linear_reg',\n",
    "    input_label_cols=['fare_amount']) AS\n",
    "SELECT\n",
    "  fare_amount,\n",
    "  passengers,\n",
    "  #pickup_datetime,\n",
    "  EXTRACT(DAYOFWEEK\n",
    "  FROM\n",
    "    pickup_datetime) AS dayofweek,\n",
    "  EXTRACT(HOUR\n",
    "  FROM\n",
    "    pickup_datetime) AS hourofday,\n",
    "  pickuplon,\n",
    "  pickuplat,\n",
    "  dropofflon,\n",
    "  dropofflat\n",
    "FROM\n",
    "  `feat_eng.feateng_training_data`"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {
    "cellView": "both",
    "colab": {},
    "colab_type": "code",
    "id": "h2yjF6uGCiZh"
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>mean_absolute_error</th>\n",
       "      <th>mean_squared_error</th>\n",
       "      <th>mean_squared_log_error</th>\n",
       "      <th>median_absolute_error</th>\n",
       "      <th>r2_score</th>\n",
       "      <th>explained_variance</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>5.256421</td>\n",
       "      <td>70.709518</td>\n",
       "      <td>0.262399</td>\n",
       "      <td>3.895416</td>\n",
       "      <td>0.236668</td>\n",
       "      <td>0.236768</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   mean_absolute_error  mean_squared_error  mean_squared_log_error  \\\n",
       "0             5.256421           70.709518                0.262399   \n",
       "\n",
       "   median_absolute_error  r2_score  explained_variance  \n",
       "0               3.895416  0.236668            0.236768  "
      ]
     },
     "execution_count": 12,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "%%bigquery\n",
    "\n",
    "# Here, ML.EVALUATE function is used to evaluate model metrics\n",
    "SELECT\n",
    "  *\n",
    "FROM\n",
    "  ML.EVALUATE(MODEL feat_eng.model_2)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {
    "cellView": "both",
    "colab": {},
    "colab_type": "code",
    "id": "bhfabG8XCiZm"
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>rmse</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>8.408895</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "       rmse\n",
       "0  8.408895"
      ]
     },
     "execution_count": 13,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "%%bigquery\n",
    "\n",
    "# Here, ML.EVALUATE function is used to evaluate model metrics\n",
    "SELECT\n",
    "  SQRT(mean_squared_error) AS rmse\n",
    "FROM\n",
    "  ML.EVALUATE(MODEL feat_eng.model_2)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "vbOSxv6BDqB-"
   },
   "source": [
    "### Model 3:  Feature cross dayofweek and hourofday using CONCAT\n",
    "\n",
    "First, let’s allow the model to learn traffic patterns by creating a new feature that combines the time of day and day of week (this is called a [feature cross](https://developers.google.com/machine-learning/crash-course/feature-crosses/video-lecture). \n",
    "\n",
    "Note:  BQML by default assumes that numbers are numeric features, and strings are categorical features.  We need to convert both the dayofweek and hourofday features to strings because the model (Neural Network) will automatically treat any integer as a numerical value rather than a categorical value.  Thus, if not cast as a string, the dayofweek feature will be interpreted as numeric values (e.g. 1,2,3,4,5,6,7) and hourofday will also be interpreted as numeric values (e.g. the day begins at midnight, 00:00, and the last minute of the day begins at 23:59 and ends at 24:00).  As such, there is no way to distinguish the \"feature cross\" of hourofday and dayofweek \"numerically\".  Casting the dayofweek and hourofday as strings ensures that each element will be treated like a label and will get its own coefficient associated with it."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Create the SQL statement to feature cross the dayofweek and hourofday using the CONCAT function.  Name the model \"model_3\""
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {
    "cellView": "both",
    "colab": {},
    "colab_type": "code",
    "id": "F7l02C9KFMy7"
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "Empty DataFrame\n",
       "Columns: []\n",
       "Index: []"
      ]
     },
     "execution_count": 14,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "%%bigquery\n",
    "#TODO 3b\n",
    "\n",
    "CREATE OR REPLACE MODEL\n",
    "  feat_eng.model_3 OPTIONS (model_type='linear_reg',\n",
    "    input_label_cols=['fare_amount']) AS\n",
    "SELECT\n",
    "  fare_amount,\n",
    "  passengers,\n",
    "  #pickup_datetime,\n",
    "  #EXTRACT(DAYOFWEEK FROM pickup_datetime) AS dayofweek,\n",
    "  #EXTRACT(HOUR FROM pickup_datetime) AS hourofday,\n",
    "  CONCAT(CAST(EXTRACT(DAYOFWEEK\n",
    "      FROM\n",
    "        pickup_datetime) AS STRING), CAST(EXTRACT(HOUR\n",
    "      FROM\n",
    "        pickup_datetime) AS STRING)) AS hourofday,\n",
    "  pickuplon,\n",
    "  pickuplat,\n",
    "  dropofflon,\n",
    "  dropofflat\n",
    "FROM\n",
    "  `feat_eng.feateng_training_data`"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>mean_absolute_error</th>\n",
       "      <th>mean_squared_error</th>\n",
       "      <th>mean_squared_log_error</th>\n",
       "      <th>median_absolute_error</th>\n",
       "      <th>r2_score</th>\n",
       "      <th>explained_variance</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>5.265321</td>\n",
       "      <td>69.356068</td>\n",
       "      <td>0.267477</td>\n",
       "      <td>3.887588</td>\n",
       "      <td>0.220371</td>\n",
       "      <td>0.220574</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   mean_absolute_error  mean_squared_error  mean_squared_log_error  \\\n",
       "0             5.265321           69.356068                0.267477   \n",
       "\n",
       "   median_absolute_error  r2_score  explained_variance  \n",
       "0               3.887588  0.220371            0.220574  "
      ]
     },
     "execution_count": 15,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "%%bigquery\n",
    "\n",
    "# Here, ML.EVALUATE function is used to evaluate model metrics\n",
    "SELECT\n",
    "  *\n",
    "FROM\n",
    "  ML.EVALUATE(MODEL feat_eng.model_3)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>rmse</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>8.328029</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "       rmse\n",
       "0  8.328029"
      ]
     },
     "execution_count": 16,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "%%bigquery\n",
    "\n",
    "# Here, ML.EVALUATE function is used to evaluate model metrics\n",
    "SELECT\n",
    "  SQRT(mean_squared_error) AS rmse\n",
    "FROM\n",
    "  ML.EVALUATE(MODEL feat_eng.model_3)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Optional:  Create a RMSE summary table to evaluate model performance."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "| Model          | Taxi Fare | Description                                  |\n",
    "|----------------|-----------|----------------------------------------------|\n",
    "| baseline_model | 8.62      | Baseline model - no feature engineering    |\n",
    "| model_1        | 9.43      | EXTRACT dayofweek from the pickup_datetime |\n",
    "| model_2        | 8.40      | EXTRACT hourofday from the pickup_datetime |\n",
    "| model_3        | 8.32      | FEATURE CROSS hourofday and dayofweek      |"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Copyright 2021 Google Inc.\n",
    "Licensed under the Apache License, Version 2.0 (the \"License\"); you may not use this file except in compliance with the License. You may obtain a copy of the License at\n",
    "http://www.apache.org/licenses/LICENSE-2.0\n",
    "Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an \"AS IS\" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License."
   ]
  }
 ],
 "metadata": {
  "colab": {
   "collapsed_sections": [],
   "name": "10.3.2019-DRAFT-_1 - FeatEnG - LABS.ipynb",
   "provenance": [],
   "toc_visible": true
  },
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.6"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}
